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A meta-analysis was conducted on the accuracy of predictions of various objective outcomes in the 
areas of social and clinical psychology from short observations of expressive behavior (under 5 
min). The overall effect size () for the accuracy of predictions for 38 different results was .39. 
Studies using longer periods of behavioral observation did not yield greater predictive accuracy; 
predictions based on observations under '2 min in length did not differ significantly from predic- 
tions based on 4- and 5-min observations. The type of behavioral channel (such as the face, speech, 
the body, tone of voice) on which the ratings were based was not related to the accuracy of predic- 
tions. Accuracy did not vary significantly between behaviors manipulated in a laboratory and more 
naturally occurring behavior. Last, effect sizes did not differ significantly for predictions in the 
areas of clinical psychology, social psychology, and the accuracy of detecting deception. 


The way in which people move, talk, and gesture—their fa- 
cial expressions, posture, and speech—all contribute to the for- 
mation of impressions about them. Many of the judgments we 
make about others in our everyday lives are based on cues from 
these expressive behaviors. Gordon Allport (1937) believed that 
expressive behaviors were important indicators of personality 
and that impressions from brief interactions were often veri- 
fied upon further acquaintance. Allport and Vernon (1933) 
demonstrated that people’s expressive styles were quite consis- 
tent across a variety of situations. They then began to investi- 
gate the accuracy of perceivers’ impressions that were based on 
observations of these expressive styles. For reasons that have 
been discussed elsewhere, this issue—like other issues con- 
cerning the accuracy of interpersonal and social perception— 
was neglected for a long time (Funder, 1987; Kenny & Albright, 
1987). Recently, however, there has been a resurgence of interest 
in the accuracy of social and interpersonal perception and 
judgment (Funder, 1987; Kenny & Albright, 1987; Kruglanski, 
1989; Swann, 1984) and in the study of expressive behavior 
(Lippa, 1983; Riggio & Friedman, 1986). 

Recent work confirms earlier findings (Passini & Norman, 
1966) that the ratings of strangers converge surprisingly well 
with self-ratings of personality by targets (Albright, Kenny, & 
Malloy, 1988; Funder & Colvin, 1988; Watson, 1989). This 
correspondence seems to confirm Gordon Allport’s observa- 
tion that there is something in the nature of individuals that 
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leads observers to attribute certain characteristics to them (All- 
port, 1937). We believe this “something” is communicated 
through expressive behavior. Much of this expressive behavior 
is unintended, unconscious, and yet extremely effective. For 
example, we communicate our interpersonal expectancies and 
biases through very subtle, almost imperceptible, nonverbal 
cues. These cues are so subtle that they are neither encoded nor 
decoded at an intentional, conscious level of awareness (Chai- 
kin, Sigler, & Derlega, 1974; Christensen & Rosenthal, 1982; 
Harris & Rosenthal, 1985; Rosenthal, 1966; Rosenthal & Ru- 
bin, 1978; Snyder, Tanke, & Berscheid, 1977; Word, Zanna, & 
Cooper, 1974). 

The remarkable aspect of this expressive behavior is its com- 
municative power. A great deal of information is communi- 
cated even in fleeting glimpses of expressive behavior. Erving 
Goffman (1979) wrote about the “glimpsed” (p. 22) world, con- 
sisting of glimpses of strangers—a world bereft of details, yet 
quite rich in social information. He suggested that a rough 
correspondence exists between the characteristics of the people 
glimpsed in this world and impressions of them. Goffman used 
the ethological concept of “displays,” or behaviors that signal 
inter- and intraspecies information rapidly and efficiently to 
explain the accuracy of impressions that are based on glimpsed 
behavior. He suggested that in humans, expressive behaviors 
constitute an important aspect of displays and, like displays in 
other species, these behaviors are processed naturally and effi- 
ciently by their targets. 

Goffman’s observations have been confirmed by subsequent 
research: Judgments about others can be quite accurate even 
when they are based on brief observations of expressive behav- 
ior (Albright et al., 1988; Funder & Colvin, 1988; Watson, 1989). 
One study reported that ratings that were based on 5-min video- 
taped clips of targets by strangers correlated to a surprising 
extent with the self-ratings of targets (Funder & Colvin, 1988). 
Others have also suggested that the unarticulated yet strong 
affective reactions, or “feelings,” that arise from fleeting 
glimpses of the behavior of others might have some basis in 
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reality (Schneider, Hastorf, & Ellsworth, 1979; Zajonc, 1980, 
1984). This suggestion is confirmed by findings indicating that 
people are fairly accurate at identifying emotions from expo- 
sures to nonverbal behavior lasting only 375 ms (Rosenthal, 
Hall, DiMatteo, Rogers, & Archer, 1979). Other research in 
social and personality psychology also suggests that people 
might be unexpectedly accurate in the judgments they make on 
the basis of minimal information and minimal amounts of cog- 
nitive processing. For example, sometimes people make better 
judgments when they rely on their intuition rather than when 
they introspect or reason (Wilson & Schooler, 1991). 

One way to examine the contribution of expressive behavior 
to the accuracy of person perception would be to rigorously 
control the nature of the information provided to observers and 
then examine the accuracy of their impressions in relation to 
various external criteria. This has been done by exposing ob- 
servers to carefully selected clips of video- and audiotaped be- 
havior and assessing how well their judgments predict some 
external criterion. A systematic review and examination of 
these studies would be useful in shedding light on the type of 
information communicated by such observations of behavior. 
To assess whether judgments that are based on glimpsed obser- 
vations are accurate, only studies in which such observers were 
exposed to fairly brief segments of behavior should be consid- 
ered. If fairly accurate judgments can be made solely on the 
basis of targets’ nonverbal behavior, this result would indicate 
the importance of such behavior in accurate impression forma- 
tion. Furthermore, the accuracy of judgments that are based on 
expressive behavior alone could be compared with the results of 
some classic studies that have examined the relationship be- 
tween various types of assessment and evaluation measures and 
the prediction of certain outcomes. For example, in the Men- 
ninger study of psychiatric residents, a variety of psychological 
assessment procedures were used to predict professional com- 
petence and success, as defined by supervisor ratings and peer 
ratings (Holt & Luborsky, 1958). Can similar predictions be 
made from judgments that are based exclusively on expressive 
behavior? 

The purpose of this article is to conduct a meta-analysis of 
studies on the accuracy of predictions from brief observations, 
or what we call “thin slices” of expressive behaviors. The results 
of this meta-analysis should have important theoretical and 
practical implications. Evidence that judgments about people 
on the basis of brief exposures to them are accurate would have 
enormous implications for the study of interpersonal percep- 
tion. First, this evidence would suggest that intuitive natural 
judgments and perceptions of others are more accurate than 
one would expect. These findings would add to the body of 
research regarding the accuracy of day-to-day decisions 
(Funder, 1987; Kruglanski, 1989). Second, such evidence would 
also indicate that the behavior of individuals is predictable 
within certain situations. A third implication of these findings 
would be that people rapidly and unwittingly communicate a 
great deal of information regarding themselves to others. 
Fourth, this evidence would have important practical implica- 
tions for various assessment, evaluation, and training proce- 
dures in clinical and other applied areas of psychology. Finally, 
such evidence would have important methodological implica- 
tions for conducting research on expressive behavior. 


Evaluation of Accuracy 


Before considering the extent of the accuracy of predictions, 
it is first necessary to consider how accuracy can be evaluated. 
The notion of accuracy implies a correspondence between a 
judgment and a criterion (Brown, 1965; Kruglanski, 1989). The 
selection of appropriate criteria to evaluate judgmental accu- 
racy is problematic. Objective, externally valid criteria against 
which to evaluate predictions in the areas of social and clinical 
psychology are difficult to find because most of the criteria in 
these areas themselves often involve judgments. Addressing 
this issue, Kenny and Albright (1987) identified a number of 
criterion measures used in social perception. These measures 
include self-report measures, third-person (expert) judgments, 
objective measurements such as physiological variables, judge 
ratings, and operational criteria such as those used in lie-detec- 
tion studies in which subjects are instructed to lie. In this meta- 
analysis, we considered only those studies that had (a) experi- 
mentally or objectively defined, clear behavioral criteria, corre- 
sponding to Kenny and Albright’s operational criteria or (b) 
criteria that were ecologically valid and commonly used in ev- 
eryday decisions about people, roughly corresponding to 
Kenny & Albright’s criterion of expert judgments. We excluded 
self-report measures as criterion variables because they seem to 
be influenced by factors such as social awareness and self- 
knowledge of the target individual (Cheek, 1982). An example 
of an experimentally defined criterion would be whether a sub- 
ject was actually lying in a deception-detection experiment. An 
ecologically valid criterion would be the use of supervisors’ 
ratings to evaluate therapeutic effectiveness, because it is one of 
the primary methods of making decisions about performance 
and promotion. For the same reason, ratings by students would 
be a satisfactory criterion to evaluate college teacher effective- 
ness. 

In the typical paradigm for research using such criteria, 
raters are asked to rate short samples of targets’ behavior on 
various affective or personality dimensions. If the ratings are 
satisfactorily reliable, they can be used to postdict, predict, or 
paridict the criterion variable. For example, judges might hear 
short samples of therapists talking to their patients, and they 
might be asked to rate the therapists on a series of dimensions 
such as anxiety, competence, and warmth. If these ratings are 
reliable, they will then be correlated with the criterion variable, 
which might be the patient’s prognosis by someone other than 
the therapist—such as a supervisor, or a team of caretakers, or 
the supervisors’ ratings of the therapist. Accuracy, in this con- 
text, refers to the correspondence between the consensual judg- 
ments of the group of judges and the criterion variable. 


Accuracy From Thin Slices of Behavior: Some Findings 


A number of recent studies have indicated that ratings of 
brief observations or thin slices of behavior can be used to 
predict various social and clinical psychological outcomes at 
levels significantly above those expected by chance (e.g. Babad, 
Bernieri, & Rosenthal, 1989b, 1989c; O’Sullivan, Ekman, & 
Friesen, 1988). We review some of the research below. This 
review is intended to be illustrative rather than exhaustive. 
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Clinical Outcomes 


In the clinical literature, thin slices of behavior have been 
used to study aspects of the therapeutic relationship. This 
method was advocated by Carl Rogers and his associates, who 
found that important variables in the Rogerian therapeutic re- 
lationship such as warmth, accurate empathy, and rapport 
could be satisfactorily assessed from two or three 2—5-min-seg- 
ment observations of behavior. Ratings of variables from these 
short segments could then be used to predict patient outcomes 
at levels above chance (Burstein & Carkhuff, 1968; Carkhuff & 
Berenson, 1967; Rogers, Gendlin, Kiesler, & Truax, 1967; 
Truax, 1966; Truax & Carkhuff, 1967; Truax, Wittmer, & 
Wargo, !971), and the reliability of ratings from short segments 
did not differ significantly from ratings of longer sessions 
(Mintz & Luborsky, 1971). 

More recently, subtle expectations and biases of therapists 
were identified by ratings of their tone of voice while talking to 
and about their patients. The same ratings served to distinguish 
therapists with high and low ratings from their supervisors 
(Blanck, Rosenthal, & Vannicelli, 1986; Blanck, Rosenthal, 
Vannicelli, & Lee, 1986). Ratings of nonverbal behavior from 
brief clips have also been used to distinguish anxious and de- 
pressed people from normal people (Waxer, 1974, 1976, 1977). 


Social Psychological Outcomes 


One area in social psychology in which thin slices of behavior 
have been used frequently to assess behavior with considerable 
accuracy is the area of interpersonal expectancies and biases. 
Brief clips of behavior have been used to identify successfully 
the subtle expressive cues conveying interpersonal expectancies 
that are very influential in the interpersonal influence process 
(Chaikin et al., 1974; Duncan & Rosenthal, 1968; Harris & Ro- 
senthal, 1985; Rosenthal, 1966, 1969; Rosenthal & Rubin, 
1978). For example, a series of studies conducted by Bugental 
and her colleagues revealed that parents’ expectancies, identi- 
fied from brief clips of their tone of voice, are related to their 
children’s behavior (Bugental, Caporael, & Shennum, 1980; Bu- 
gental, Henker, & Whalen, 1976; Bugental & Love, 1975; Bu- 
gental, Love, Kaswan, & April, 1971). Thus ratings of the tone 
of voice of mothers of normal children and children with behav- 
ior problems in school differed significantly, with the latter 
mothers revealing a lack of confidence in their ability to control 
their children in their tone of voice (Bugental & Love 1975). 
Research in the classroom has shown that judges can distin- 
guish biased from unbiased teachers and also can identify dif- 
ferential teacher expectancies and affect toward students from 
very brief clips of teachers’ behavior (Babad, Bernieri, & Ro- 
senthal, 1987, 1989b, 1989c). Research in the courtroom has 
shown that from brief excerpts of judges’ instructions to jurors 
in actual criminal trials, raters could postdict the judges’ expec- 
tations for the trial outcome and the criminal history of the 
defendant (Blanck, Rosenthal, & Cordell, 1985). 

Ratings from thin slices have also been used to make accu- 
rate predictions regarding social outcomes pertaining to the 
communication of affect. For example, one set of studies inves- 
tigated the communication of affect by network television new- 
scasters. In the first study, 2.5-s long clips of the facial expres- 


sions of newscasters during the 1976 presidential election cam- 
paign revealed significant differences in the facial expressions 
of the newscasters as a function of the candidate they were 
talking about (Friedman, DiMatteo, & Mertz, 1980). The sec- 
ond study extended this finding by relating it to an outcome 
measure. Ratings of 2.5-s clips of network newscasters’ facial 
expressions during the 1984 presidential elections showed that 
one newscaster had significantly more positive facial expres- 
sion when talking about one of the candidates. Voters who regu- 
larly watched this newscaster were significantly more likely to 
vote for the candidate he favored (Mullen et al., 1986). 

Research on the accuracy of the detection of deception has 
relied almost exclusively on ratings of thin slices of behavior. A 
commonly used experimental paradigm involves subjects being 
audiotaped or videotaped while honestly describing someone 
they like, honestly describing someone they dislike, and dis- 
honestly pretending to like the disliked target and dislike the 
liked target. From short clips of their behavior in each of these 
conditions, judges are asked to rate the honesty of the subject 
.g., DePaulo & Rosenthal, 1979). Reviews of research on the 
accuracy of deception detection have found that the mean accu- 
racy of detection was above chance (DePaulo, Zuckerman, & 
Rosenthal, 1980; Zuckerman, DePaulo, & Rosenthal, 1981), al- 
though it appears that it is influenced by a number of factors in 
the experimental situation (Bond, Kahler, & Paolicelli, 1985; 
DePaulo, Kirkendol, Tang, & O’Brien, 1988). 

How well do ratings from thin slices of behavior predict clini- 
cal and social outcomes compared with predictions that are 
based on other methods? This question is addressed later in this 
article. Although short segments of behavior apparently can be 
used to predict socially and clinically important outcomes suc- 
cessfully, there has been some controversy about whether rat- 
ings of specific behavioral channels (such as speech, posture, or 
facial expression) are linked to more accurate assessment 
(Archer & Akert, 1977; Argyle, Alkema, & Gilmour, 1971; 
Brown, 1986, pp. 496-502; Mehrabian & Wiener, 1967). We 
review some of the pertinent research in the next section. 


Accuracy of Predictions From Different Channels 
of Communication 


Studies on the accuracy of predictions from short segments 
of behavior have varied the behavioral channels shown to 
judges. The typical channels shown include (a) the nonverbal 
channels, which include the visual channels (face, body, or face 
and body) and the vocal channel (just tone of voice), (b) the 
verbal channels, which include speech and transcripts, and () 
the audiovisual channel (combining the visual and verbal chan- 
nels). In comparing these channels, most of the research has 
addressed two related issues. The first concerns the relative 
efficacy of the verbal versus the nonverbal channels. The sec- 
ond concerns the relative efficacy of the various nonverbal 
channels. 

Most of the work reviewed in this section has been drawn 
from the social psychological literature, especially work on the 
detection of deception. Studies in the clinical literature have 
indicated that ratings do not seem to differ significantly across 
channels (Burns & Beier, 1973; English & Jelenevsky, 1971; 
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Strahan & Zytowski, 1976; Strong, Taylor, Bratton, & Loper, 
1971). 

Goffman (1959, 1971) pointed out that expressive behaviors 
can be controlled to meet certain self-presentation goals and to 
convey certain impressions. Although we are generally able to 
monitor and control aspects of our behavior according to the 
“display rules” that determine what behaviors are culturally and 
socially appropriate, sometimes our true feelings “leak” out 
through the behavioral channels that are less controllable (Ek- 
man & Friesen, 1969). At some level, perceivers realize that 
people can choose their words carefully, but they are less adept 
at controlling their facial, vocal, and bodily expressions. The 
lack of control of nonverbal behavior could be attributed to a 
lack of awareness of these behaviors, because people cannot see 
or hear themselves as others do. 

The most controllable channel is the verbal channel (speech), 
followed by the face, the body, and the least controllable chan- 
nel, the voice (Brown, 1986; Rosenthal & DePaulo, 1979). The 
leakier, less controllable, nonverbal channels should be more 
accurate. When verbal and nonverbal behaviors are inconsis- 
tent, nonverbal behaviors may be more revealing of the true 
message. For example, Blanck et al. (1985) found that judges 
who expected a defendant to be guilty revealed their expecta- 
tion in their nonverbal behavior, but not in their verbal behav- 
ior. Similarly, distressed married couples trying to act happy 
could be distinguished from happy couples by their nonverbal 
behavior rather than their verbal behavior (Vincent, Friedman, 
Nugent, & Messerly, 1979). Much of the research comparing 
channels of communication, however, has focused on the accu- 
racy of identifying deceptive or inconsistent messages. Ina theo- 
retical review of the literature, Noller (1985) found that access 
to the leaky channels becomes more important for accuracy for 
deceptive or inconsistent behaviors. But meta-analytic results 
clearly indicate that the presence of verbal content improves 
the accuracy of detecting deception (DePaulo et al., 1980; Zuck- 
erman, DePaulo, & Rosenthal, 1981). Although the issue of the 
relative importance of the verbal and the nonverbal channels 
has not been resolved, the relative importance of the channel 
apparently depends on a number of different factors. These 
include (a) the expectations of the judges (Zuckerman, Spiegel, 
DePaulo, & Rosenthal, 1982), (b) the type of message being 
conveyed (Apple, Streeter, & Krauss, 1979; Ekman, 1988; Ek- 
man, Friesen, O’Sullivan, & Scherer, 1980; Streeter, Krauss, 
Geller, Olson, & Apple, 1977; Zuckerman, Amidon, Bishop, & 
Pomerantz, 1982; Zuckerman, Larrance, Spiege], & Klorman, 
1981), © situational factors such as familiarity with the situa- 
tion regarding which judgments have to be made (Krauss, Ap- 
ple, Morency, Wenzel, & Winton, 1981; Stiff et al., 1989; Zuck- 
erman, Spiegel, DePaulo, & Rosenthal, 1982), ) the type of 
affect being expressed and the type of affect being rated (De- 
Paulo & Rosenthal, 1979; DePaulo, Rosenthal, Eisenstat, 
Rogers, & Finkelstein, 1978; Noller, 1985; Scherer, Scherer, 
Hall, & Rosenthal, 1977; Zuckerman, Hall, DeFrank, & Ro- 
senthal, 1976; Zuckerman, Spiegel, DePaulo, & Rosenthal, 
1982), ©) the quality of the information being transmitted by 
the various channels (Argyle et al., 1971; Gallios & Callan, 
1986), and (f) the motivation of the subjects (DePaulo et al., 
1988; DePaulo, Lanier, & Davis, 1983). 

What implications do these findings regarding the contribu- 


tions of the various behavioral channels have for this article? We 
expect that the accuracy of predictions from brief observations 
of behavior will not be significantly different for the different 
channels. If anything, information from observations including 
verbal information should be superior. In summary, the major 
questions addressed by this meta-analysis are (a) can accurate 
predictions be made from short observations of expressive be- 
havior, (b) if they can be made, how accurate are these predic- 
tions, and (©) are certain behavioral channels associated with 
more accurate predictions from thin slices of behavior? 


Method 
Literature Search 


Four methods were used to locate the relevant studies. First, an 
initial computer search of Psychological Abstracts was conducted to 
retrieve documents containing the terms accuracy, deception, and non- 
verbal behavior. However, because our main criterion for inclusion of 
studies in this analysis was a methodological one, this method did not 
prove to be very useful. The second method was a manual search of 
volumes of the following journals covering a time span ranging from 
1970 to 1990. Journals were selected on the basis of relevance to social 
and clinical psychology and citation in other relevant articles and 
books. Journals manually searched for articles included the European 
Journal of Social Psychology, Journal of Abnormal Psychology, Journal 
of Applied Social Psychology, Journal of Clinical Psychology, Journal of 
Communication, Journal of Consulting and Clinical Psychology, Journal 
of Counseling Psychology, Journal of Educational Psychology, Journal of 
Experimental Social Psychology, Journal of Nonverbal Behavior, Jour- 
nal of Personality and Social Psychology, and Personality and Social 
Psychology Bulletin. Third, publications (especially those published 
before 1970) were located by searching reference lists of relevant arti- 
cles and books. Last, our own files provided preprints and unpub- 
lished manuscripts pertinent to this investigation. 

The last three methods yielded nearly 100 studies. To be included in 
this review, a study had to meet the following criteria: 

1. Ratings or judgments should have been based on no more than 
300 s of behavioral observations. Every observation of the same subject 
was included in estimating the observation length. For example, if 
sixty 10-s clips of the same subject were rated in a study, judges would 
have observed the subject for 600 s, and the study would have been 
excluded from this analysis. If no information was provided on the 
length of observations and if length could not be assessed from other 
information provided, the study was not included.' We chose a cutoff 
of 5 min rather arbitrarily, in appreciation of the work by Pittenger, 
Hockett, and Danehy (1960), who elegantly described the richness of 
information that can be communicated in just 5 min in the context ofa 
diagnostic therapeutic interview. Although subsequent work has 
shown that a great deal of information can be conveyed in even briefer 
time periods (¢.g., DePaulo & Rosenthal, 1979; Milmoe, Novey, Kagan, 
& Rosenthal, 1968; Milmoe, Rosenthal, Blane, Chafetz, & Wolf, 1967; 
Rosenthal, Blanck, & Vannicelli, 1984), we selected 5 min (300 s) as our 
cutoff for this review. 

2. Short behavioral ratings had to be related to some clearly defined 
external, objective, behavioral criterion or to the criterion of ratings by 


‘Some studies did not provide the observation length, but it was 
possible to estimate it from other information provided in the study. 
For instance, one study gave examples of four scenes typical for length 
and message type (Bugental, Love, Kaswan, & April, 1971). We had 3 
people read these messages at a normal rate of speaking and averaged 
their timed responses to get an estimate of observation length. 
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experts. An external criterion would be the existence of deception. An 
expert criterion would be supervisor’s ratings of effectiveness. Self- 
report ratings were not used as criterion variables nor were peer rat- 
ings. 

3. The study had to contain enough information to permit estima- 
tion of the significance level and effect size.? Forty-four studies met all 
the criteria for inclusion. These studies certainly represent a large sam- 
ple of the population of studies that met all our criteria, although it is 
always possible that some studies might have been overlooked. The 
potential impact of studies we might have missed will be considered in 
the Results section. While coding all the studies, it became clear thata 
few studies were correlated (because they had used the same subjects). 
Results from these were combined, and the final sample consisted of 
38 independent results drawn from 44 studies. Table 1 contains infor- 
mation about the authors, year of publication, criterion variable, and 
total length of observation for each study. 


Coding Procedure 


The following variables regarding subjects and raters or judges were 
coded for each study: (a) number of subjects overall (stimulus persons 
or ratees), (b) proportion of female subjects, (c) whether the subjects 
were college students, non-college students, or children, d) whether 
behavior of the subjects was naturally occurring or experimentally ma- 
nipulated, (e) the number of raters (or judges), and (f) the proportion of 
female raters. 

General information about the study recorded included (a) whether 
it was a field or a laboratory study (this category overlapped consider- 
ably but not completely with the previously mentioned category re- 
garding naturally occurring or manipulated behavior), (b) whether the 
outcome variables could be defined as related to clinical psychology, 
social psychology or the psychology of deception (although studies on 
deception fall under the general rubric of social psychology, they were 
considered a separate category because of previous work on deception 
accuracy as a separate category), (c) the year of publication of the study, 
and (d) the publication outlet. 

Specific information coded for each study included (a) the number of 
clips rated for each subject, (b) the number of behavioral clips rated by 
each judge, (c) the length of each clip, ) the length of the total observa- 
tion time for each subject, (e) whether nonverbal behavior alone or 
whether both verbal and nonverbal behaviors were rated, and (f) the 
behavioral channels rated (face, body, speech, tone of voice, tran- 
scripts, or combinations of the above). 

Information about the results also included the magnitude (the ef- 
fect size) and the significance level of the relationship between the 
criterion variable and the behavioral dimensions rated. For each study, 
only one effect size and one level of significance were recorded to meet 
the requirement of independence of effect sizes and significance levels. 
When there were multiple results that seemed to be correlated in a 
single study, and the correlation between these multiple dependent 
variables could be estimated, Rosenthal and Rubin's (1986) formula 
was used to compute effect sizes and significance levels.? When the 
intercorrelation could not be estimated, the mean of the relevant re- 
sults was used in the meta-analysis. This is a robust procedure, al- 
. though it is conservative, and probably deflated the effect sizes (Ro- 

senthal, 1984; Rosenthal & Rubin, 1986).* Also coded were (a) the direc- 
tion of the effect and (b) the effect sizes and significance levels 
separately for each behavioral channel, if they were reported separately 
in the studies. The meta-analytic procedures used to analyze these 
data are those described in Rosenthal (1984).° The effect size (ZA) and 
level of significance (Z) for each study are also presented in Table 1. 


Study Characteristics 


Of the 44 studies, 68.2% (30) were reported between 1980 and 1990, 
27.3% (12) between 1970 and 1979, and 4.5% (2) between 1960 and 


1969. The median year of publication was 1983. Forty-one studies ap- 
peared in journals, 1 was reported in a book chapter, and 2 were un- 
published manuscripts. The median number of subjects was 32 and the 
range was from 2 to 271. Of the 38 results analyzed, 6 studies used only 
female subjects, 8 used only male subjects, 9 used equal numbers of 
male and female subjects, 2 used between 51% and 99% female sub- 
jects, 2 between 51% and 99% male subjects, and 11 did not report 
subject gender. The median number of judges per study was 37, and the 
number ranged from 2 to 446. All the judges were naive subjects. Five 
studies reported using only female judges, 1 study used only male 
judges, 13 studies used equal numbers of male and female judges, 3 
studies used 51-99% female judges, 3 studies used 51-99% male 
judges, and 13 did not report genders. 


Results 


Table 2 contains a stem and leaf display of the effect sizes of 
the studies in the meta-analysis, Table 3 contains some addi- 
tional useful information about central tendency, variability, 
significance tests, and confidence intervals. 


Effect Size and Significance Testing 


All the results show a positive effect size for accuracy in pre- 
dictions, where 50% would be expected under the null hypothe- 
sis. The mean effect size for accuracy of judgment from all 
segments of behavior under 300 s was .39. This effect was asso- 


? In acritique of conclusions on the accuracy of deception detection 
¢.g., DePaulo, Zuckerman, & Rosenthal, 1980), Kraut (1980) argued 
that hit rates should be considered in addition to effect sizes in estimat- 
ing accuracy. If looked at in this way, accuracy scores for the detection 
of deception rarely exceed 65%, which is regarded as a medium effect 
size by Cohen (1988), given a chance level of 50%. Although we agree 
that providing several effect-size estimates can be useful in some cir- 
cumstances, our analysis was confined to the more common effect 
sizes, largely because most of the studies reviewed provided no infor- 
mation regarding hit rates. 

3 For example, in one study, therapists’ tone of voice in talking about 
patients was used to predict how they would talk to those patients. 
Judges rated the latter dependent variable on dimensions of warmth, 
dominance, empathy, competence, hostility, anxiety, optimism, profes- 
sionalism, honesty, and liking (Rosenthal, Blanck, & Vannicelli, 1984). 
The typical correlation among these variables is .50; this was the value 
entered into Rosenthal and Rubin’s (1986) equation. The purpose of 
this equation is to get a more accurate estimate of effect sizes and 
significance levels, because traditional methods such as computing the 
mean or median are too conservative (see Rosenthal & Rubin, 1986, for 
more information). 

‘ To estimate the mean effect size, each r value was transformed to zr 
before computing the mean, which was then converted into the r for 
this study. Estimation of the mean significance level was done in one of 
two ways: (a) When the 7 was reported, the ¢ for each result was com- 
puted from the r using the formula t = r {df/(1 — 7°)]¥2; the correspond- 
ing Z (standard normal deviate) associated with the p value of each ¢ 
was found and these Zs were averaged, and the p level corresponding 
to the average Z was the significance level of the study, and (b) when the 
r was not reported, the Z was found for the p value reported in the 
result, and r was obtained from the formula: r = Z/N'. If results were 
reported as nonsignificant and no p value was specified, the p value 
was assumed to be .50 with a Z of 0 (Rosenthal, 1984). 

5 We used a program written by Monica Harris (1984) for the meta- 
analytic computations. 
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Tabie 1 
Summary Table of Studies, Criterion Variables, and Length of Exposure 
Length 
Authors Area Criterion variable (in seconds) Zr Z 
Ambady and Rosenthal S Teacher effectiveness 30 0.95 3.62 
(1990) 
Apple and Hecht (1982) S Actual affect of 96 0.23 3.48 
subjects 
Archer and Akert S Actual social behavior 30-60 0.83 12.18 
(1977) of subjects 
Babad, Bernieri, and S Existence of bias in 270 0.60 3.16 
Rosenthal (1989b) teachers 
Babad, Bernieri, and S Expectancies of 120 0.42 5.24 
Rosenthal (1989a) teachers 
Babad, Bernieri, and S Status of teachers 270 0.55 3.12 
Rosenthal (1987) 
Blanck, Rosenthal, and Cc Status of patients 50-100 0.15 3.32 
Vannicelli (1986), 
Blanck, Rosenthal, Cc Supervisor ratings of 50-100 
Vannicelli, and Lee therapists 
(1986), 
Rosenthal, Blanck, and Cc Way therapist would 30-60 
Vannicelli (1984), speak to patients 
Bugental, Love, Types of families 3.5 0.16 0.85 
Kaswan, and April 
(1971) 
DePaulo and Rosenthal D Existence of deception 240 0.28 4.26 
(1979), and type of affect 
DePaulo, Rosenthal, D Existence of deceptive 120 
Green, and and mixed messages 
Rosenkrantz (1982), 
DePaulo, Lanier, and D Existence of deception 120 0.59 4.37 
Davis (1983) 
DePaulo, Lassiter, and D Existence of deception 120 0.16 1.82 
Stone (1982) 
DePaulo, Stone, and D Existence of deception 300 0.27 1.95 
Lassiter (1985) 
Feldman (1976) D Existence of deception 30 0.33 2.00 
and honesty in 
teachers 
Feldman, Jenkins, and D Existence of deception 20-40 0.44 2.29 
Popoola (1979) 
Feldman and Prohaska S Adequacy of teachers 20 0.37 2.21 
(1979) 
Friedman, Hall, and Cc Score on index of 35 0.27 3.20 
Harris (1985) peripheral artery 
disease 
Fugita, Hogrebe, and D Existence of deception 240 0.21 3.75 
Wexley (1980) 
Hall and Braunwald S Detecting gender of 40 0.42 3.08 
(1981) target being 
addressed 
Hall, Roter, and Rand Cc Predicting patient 90 0.30 2.94 
(1981) commitment & 
compliance 
Hall, Roter, and Katz Cc Physician proficiency 90 0.22 3.72 
(1987), and patient 
satisfaction 
Roter, Hall, and Katz 
(1987), 
Kaul and Schmidt Cc Trustworthiness in 300 0.47 2.32 
(1971) interviewers 
Machida (1986) S Comprehension in 120 0.93 4.75 
children 
Manstead, Wagner, and D Existence of deception 60 0.32 1.82 
MacDonald (1986) 
Milmoe, Rosenthal, Cc Referral of alcoholic 90 0.42 1.64 


Blane, Chafetz, & 
Wolf (1967) 


patients 
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Table | (continued) 





Length 
Authors Area Criterion variable (in seconds) Zr Z 
Milmoe, Novey, Cc Behavior of babies 51 0.42 1.70 
Kagan, & Rosenthal 
(1968) 
Mulien et al. (1986) S Voting behavior 25 0.23 2.55 
O’Sullivan, Ekman, D Existence of deception 120 0.14 0.78 
and Friesen (1988) 
Riggio and Friedman D Existence of deception 78 0.29 3.80 
(1983) 
Riggio, Tucker, and D Existence of deception 120 0.11 1.14 
Throckmorton 
(1987) 
Steckler and Rosenthal S Communication with 30 O11 0.91 
(1985) boss, peer, or 
subordinate 
Stiff et al. (1989) D Existence of deception 180 0.25 5.12 
Streeter, Krauss, Geller, D Existence of deception 26 0.21 2.42 
Olson, and Apple 
(1977) 
Waxer (1977) Cc High- versus low- 60 0.58 3.25 
anxiety patients 
Waxer (1976) Cc Depression in patients 120 0.58 4.08 
Waxer (1974) Cc Depression in patients 120 1.33 9.59 
Zuckerman, DeFrank, D Existence of deception 120 0.60 8.17 
Hall, Larrance, and 
Rosenthal (1979) 
Zuckerman, Koestner, D Existence of deception 200 0.74 13.26 
Colella, and Alton 
(1984), 
Zuckerman, Koestner, Existence of deception 200 
and Alton (1984),* 
Zuckerman, Fisher, D Existence of deception 200 
Osmun, Winkler, 
and Wolfson (1987), 
Zuckerman, Driver, D Existence of deception 100 0.26 2.35 
and Guadagno (1985) 





Note. Length of the clips was calculated by combining the length of clips for each subject. Thus, if judges 
rated three 10-s clips of the same subject, length would be recorded as 30 s. Studies having the same 
subscript were combined in this analysis. C = studies with clinical criteria, D = studies on accuracy of 
deception, S = studies with social psychological criteria. 

* Results from the control condition alone were included in this analysis, because the experimental 
conditions were training conditions to increase the accuracy of detecting deception. 


ciated with a statistically significant Z of 22.56, p<.1'!*. When 
weighted by the degrees of freedom, the mean r was .41. This 
significant and substantial effect size (Cohen, 1988) indicates 
that people can predict outcomes quite accurately from very 
small segments of behavior. The 95% confidence interval sug- 
gests the likely range of effect sizes to be from .34 to .48. 

The coefficient of robustness (Rosenthal, 1990) is the recipro- 
cal of the coefficient of variation and provides an index of the 
stability and replicability of the average effect size. It does not 
increase with the increasing number of replications. Robust- 
ness increases as the variance in effect sizes decreases and as the 
distance of the mean effect size from zero increases. Although 
comparative data do not exist yet, we hope that researchers will 
report this statistic in the future. 

Of the 38 results analyzed here, 1 1 were coauthored by Robert 
Rosenthal, 10 were authored by Rosenthal’s former students, 
and 17 were from other laboratories. We compared these results 
and found for studies coauthored by one of us the effect size r 


was .38 (Z = 10.79), for studies authored by Rosenthal’s former 
students the effect size r was .40 (Z = 16.08), and for studies 
from other laboratories the effect size r was .39 (Z = 12.71). 
These effect sizes were very homogeneous, F(2, 35) = .04, 
p>.li. 


File Drawer Analysis 


Results that fail to reach statistical significance are less likely 
to have been published (Rosenthal, 1984). A number of such 
studies might have accumulated in researchers’ file drawers. 
For the probability for this meta-analysis to become nonsignifi- 
cant (p> .05), using simple calculations (Rosenthal, 1984, 1987) 
there would have to be 7,110 studies with mean probability of 
.50 languishing in file drawers. 


Additional Analyses 


Correlational analyses revealed that neither the number of 
judges or subjects nor their gender was significantly related to 
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the accuracy of predictions. The latter result was surprising 
because previous studies indicated that women tended to be 
more accurate interpreters of nonverbal cues than men (Hall, 
1984). But this superiority is also found to depend on the type 
of cue; women are better at decoding the less but not the more 
“leaky” channels (Rosenthal & DePaulo, 1979). Because this 
meta-analysis included studies using all the various channels, 
the superiority effect for women might have been canceled out. 

For further analyses, studies were classified and compared 
along various categories of interest described below. These cate- 
gories and their respective results are displayed in Table 4. 

Length of exposure. A central issue in this review was 
whether increasing the length of the audio or video clips rated 
would increase the accuracy in predicting criterion variables. 
One of our most important findings in terms of its methodologi- 
cal implications was that effect sizes for studies with varying 
length of exposures (less than 30 s, 30-60 s, 60-120, 120-180s, 
180-240 s, or 240-300 s) were very close, ranging from .24 to.45 
(all associated with statistically significant Zs). A linear con- 
trast testing the effect of exposure length on mean effect sizes 
was not significant (Z = .03), indicating that accuracy does not 
increase with longer exposures. A second contrast compared 
the effect size for exposures under 30 s to the other five effect 
sizes. This contrast was also not significant (Z = —.11). This 
suggests that judgments from very brief segments of behavior 
(under half a minute in length) may be as accurate as judgments 
from longer segments (up to 5 min long). Longer exposures do 
not seem to increase accuracy significantly. 

Type of outcome. Classification of studies by the type of 
outcome predicted revealed an average effect size of .41 for 
studies with clinical outcomes. Studies with general outcomes 
in the area of social psychology had an average effect size of .47° 
and the effect size for those with outcomes specifically related 
to the accuracy of detecting deception was .31. On the whole, 
the type of outcome predicted apparently is not significantly 
related to the accuracy of predictions, as shown by the results of 
a linear contrast done on the effect sizes (Z = —.26; weights for 
the contrast of —1 for clinical outcomes, 0 for social, and | for 
deception studies reflected the amount of experimental control 
over the stimuli). 

Type of study. Combined effect sizes obtained for studies in 
which subjects’ behavior had been experimentally manipulated 
in the laboratory were compared with those obtained from 
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Table 3 
Statistical Summary of 38 Results (45 Studies) 
Statistic Value 
Central tendency (7) 
Unweighted 4 39 
Weighted 41 
Proportion >0.00 100.0% 
Significance tests 
Combined Stouffer Z 22.56 
t test* for mean r 11.39 
Variability (r) 
Maximum 87 
Quartile 3 (Q3) 52 
Median (Q2) 30 
Quartile 1 (Q1) 22 
Minimum 10 
Q3 - Qi 30 
& [.75(Q3 — QL]. 23 
Ss 19 
SD/YN (SE) 03 
Robustness (M/SD) 2.01 
Cl for r° 
95% 34-.48 
99% 31-51 
99.9% .28-.54 
- Zr 7 : 
t= —— _ °Cl= confidence interval (1 = 38). 
V1/38(SD?) 


nonexperimental studies in which subjects’ behavior was al- 
lowed to vary naturally. A contrast on the effect sizes (experi- 
mental studies r = .32, field studies r = .47) showed that the 
degree of control used in the study was not significantly related 
to the accuracy of predictions. Note that most of the experimen- 
tal studies were in the area of detecting deception, and most of 
the nonexperimental studies related to the prediction of clini- 
cal criterion variables. As far as they go, then, these results do 
not overwhelmingly support the proposal that observations 
from natural situations should be more accurate than observa- 
tions from laboratory situations (Funder, 1987), although it is 
interesting that the effect size for field studies was higher than 
the effect size of laboratory studies. 

Type of behavior. For this analysis, results were categorized 
according to whether only nonverbal behavior was rated or 
whether verbal behavior was also included (usually speech or 
transcripts). Again a contrast on the effect sizes indicated that 
there were no significant differences between the two catego- 
ries (the correlation for nonverbal alone was .45; for verbal and 
nonverbal, the correlation was .35). Both effect sizes were asso- 
ciated with significant Zs. 


Results for Different Channels 


Table 5 contains the results for different channels rated 
across all studies. There are 65 results because in some studies 


* This category included 6 studies related to the social psychology of 
education. The Z for the education-related studies was 9.20, with a 
corresponding effect size r of .56; Z for the 5 studies with pure social 
psychological outcomes was 9.93, and the r was .35. These results were 
not significantly different (Z = .29), so they were combined under the 
category of social psychology. 
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Table 4 
Results of Studies Classified by Length of Exposure, Type of 
Outcome, Type of Study, and Type of Behaviors Rated in the Study 


Trimmed 
Weighted 
Category n Z r Z r r SD? Rob 
Exposure length? 

0-30 7 5.50 .33 4.51 .26 25 210 1.57 
30-60 7 10.38 44 6.10 .41 57 14 3.14 
60-120 16 15.03 .39 13.35 .35 36 23° 1.70 
120-180 1 512 24 — — _ —_- — 
180-240 3 12.28 .39 4.26 .27 40 23 1.70 
240-300 4 5.28 45 3.84 .49 39 13° 3.46 

Type of outcome 
Clinical 11 10.69 .41 8.44 .35 42 22 1.86 
Deception 16 14.83 .31 12.09 .30 31 15 2.07 
Social 11 13.36 .47 10.40 .47 52 21 2.24 
Type of study 
Field 17 13.89 47 12.17 .44 44 21 2.24 


Laboratory 21 17.85 .32 15.55 .32 40 -16 2.00 


Behaviors rated 


NV 15 14.42 .45 12.68 .42 .39 24 1.88 
V+NV 23 17.35 .35 15.09 .34 42 16 2.19 


Note. Rob = coefficient of robustness, NV = nonverbal, V = verbal. 
Dashes signify nonapplicable. 
* Based onr. In seconds. 


ratings were made separately for different channels. However, 
only one result was coded per channel for each study. The effect 
sizes ranged from .26 for ratings that were based on transcripts 
to .54 for ratings that were based on the face and body. All effect 
sizes were associated with significant Zs, and linear contrast 
analyses revealed no significant differences between them— 
contrasts compared the individual channels (face, body, and 
speech) to the combined channels, because the effect for the 
combined channels was expected to be larger. Judges appar- 
ently were most accurate when they were able to observe the 
face and the body (r = .54). Across all studies, the level of accu- 
racy declined to .28, though not significantly so (Z = —.72, 
when speech was added to the face and body. However, this 
issue could also be looked at within studies. In three studies, 
some judges saw just the face and body whereas others were 
exposed to the face, body, and speech; these studies found that 
ratings that were based on the face, body, and speech (r = .42) 
were more accurate than ratings that were based on just the 


Table 5 
Accuracy of Judgments for Different Channels 
Channel n Zz r 

Body 2 2.32 .28 
Face 5 7.32 40 
Speech 8 8.20 36 
Tone of voice 12 7.66 26 
Transcripts 6 4.75 29 
Body + speech 2 2.92 .33 
Face + body 12 16.24 54 
Face + speech 3 9.17 41 
Face, body, and speech 15 9.54 28 


face and the body (r = .24), although the two were not quite 
significantly different (Z = 1.56). Excluding the studies in 
which ratings had been obtained on the basis of the face and 
body both with and without speech, ratings of just the face and 
body were more accurate (r = .62) than those based on the face, 
body, and speech (r= .24), although again the difference was not 
significant (Z = —.91), suggesting that too much information 
was confusing or distracting to the judges. For the individual 
channels, ratings that were based on the face alone (r = .40) or 
on speech alone (r = .36) were slightly though not significantly 
more accurate than ratings that were based just on transcripts 
(r = .29), the body (r = .28), or tone of voice (r = .26). 

Previous reviews focusing only on the accuracy of detection 
of deception had indicated that verbal content improved accu- 
racy considerably (DePaulo et al., 1980; Zuckerman et al., 
1981). These reviews included some studies with brief clips. To 
find out whether the presence of substantive verbal content 
increased accuracy in the present review, results were combined 
across channels for the presence or absence of speech. These are 
shown in Table 6. 

The presence or absence of speech apparently does not signif- 
icantly affect judgmental accuracy when exposures to stimuli 
are brief. The difference between our results and the results of 
the previous meta-analyses may be due to the fact that the ma- 
jority of the studies included in the latter used longer clips of 
behavior. When observations are longer, cues from the verbal 
channel might be more informative and lead to greater predic- 
tion accuracy. 


Comparisons Within Areas 


We also examined the effect of length of observation, the 
behaviors rated, and the type of study within the clinical, 
deception, and social studies. These results are presented in 
Table 7. 

Note that there are missing data for some of the columns, and 
some of the results are based on single studies. In the clinical 
area, there were no significant differences between studies with 
differing lengths of exposure (Contrast Z = .08) or different 
channels (Contrast Z = —.17). Within the social area, there were 
no differences between various lengths of exposure (Z = .04), 
channels (Z = —.07), or the type of study (Z = —.19). Similarly, 
for deception studies, there were no significant differences in 
exposure lengths (Z = .06) or different channels (Z = .05). 


Table 6 
Presence of Content and Accuracy of Judgments 
jor Different Channels 


Channel 
Content Face (F) Body (B) F+B M 
Present (speech) Al .33 28 34 
N 3 2 15 
Absent (no speech) 40 .28 54 Al 
N 5 2 12 
M 4) 31 Ad 38 
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Comparing Length of Exposure, Type of Study, and Type of Behavior 


Rated Within Different Outcome Areas 





Clinical 
Category n Z r 
Exposure length* 

0-30 1 0.85 16 
30-60 3 4.67 40 
60-120 6 9.88 43 

120-180 ~ — — 

180-240 —_ _ —_ 

240-300 i 2.32 47 
Type of study 

Field 10 10.03 43 

Laboratory 1 3.72 .22 
Behavior rated 

NV 7 9.88 46 

V+NV 4 4.67 30 


Outcome variable 
Social Deception 

n Z r n Z r 

4 4.65 39 2 3.13 27 
2 10.79 56 2 2.91 36 
3 7.78 48 7 8.48 30 
= as = 1 5.12 24 
_ —_— —_ 3 12.28 39 
2 4.40 52 1 1.95 26 
7 9.65 53 — _ —_— 
4 9.39 36 16 14.38 31 
5 7.43 .50 3 7.57 32 
6 11.30 45 13 12.81 31 


Note. Dashes signify nonapplicable. NV = nonverbal, V = verbal. 


"In seconds. 


Thin Compared With Thick Slices of Behavior 


At this stage, the reader might wonder how predictions from 
thin slices of behavior compare with those made from other 
sources. We could not locate studies that have compared pre- 
dictions from brief clips directly with predictions from other 
sources. However, a few studies have used other methods to 
predict criterion variables similar to the ones predicted from 
brief clips. Wiggins (1973) discusses some of the classic Ameri- 
can “milestone” studies on predicting criterion variables by 
various evaluation and assessment methods—including inter- 
views, self-report, and projective personality assessment mea- 
sures. We compared the results of this meta-analysis with the 
results of the classic milestone studies reported by Wiggins as 
well as with a few other studies that have used criterion mea- 
sures similar to ours. 

Our first comparison was with Wiggins’s (1973) summary of 
the results of the Office of Strategic Services (OSS) assessment 
study. The OSS study was undertaken in 1948 in an effort to 
predict the effective performance of OSS officers. Predictions 
were based on a variety of psychological tests, situational perfor- 
mance measures, and interviews. The criterion variables in- 
cluded appraisals on the job by superiors and assessment staff. 
Wiggins reported the correlations between various assessment 
and appraisal ratings (Table 11.4, p. 534). We combined the 
correlations and calculated that the average correlation be- 
tween predictor and criterion variables was .26. Contrasting 
this effect size to the effect size of the present study (r = .39) 
revealed no significant difference in the accuracy of prediction 
between the two studies (Z = .62). Results comparing the effect 
size obtained in the present meta-analysis with results from 
other studies are displayed in Table 8. 

Our second comparison was with the results of another mile- 
stone study reported by Wiggins (1973). This is the Veterans 
Administration (VA) assessment project conducted between 


1946 and 1949 to evaluate procedures used to select clinical 
psychologists. Clinical trainees were evaluated at various test- 
ing centers by staff judges using a variety of psychological tests 
and interviews, and these assessments were related to a number 
of different criterion measures including performance ratings 
from university departments, analyses of examination perfor- 
mance, field tests of work samples, and ratings by supervisors 
and colleagues. Again, our computations are based on results 
reported by Wiggins (Table 11.11; p. 563). Results comparing 
the average correlation obtained in the VA study from the 
pooled assessment of the judges and from the best psychologi- 
cal test predictors are reported separately in Table 7. Again, 
contrast analyses revealed that these effect sizes did not differ 
significantly from the effect size obtained in the present study. 

Our third example is also with research in clinical psychol- 
ogy and is based on the work of Holt and Luborsky (1958). This 
Study is one of the milestone studies reported by Wiggins 
(1973), but we compared our results with those from a table not 
included in his write-up of the study. Holt and Luborsky stud- 
ied over 200 psychiatric residents at the Menninger School of 
Psychiatry, using several different methods to predict psychiat- 
ric competence. After considerable deliberation, they selected 
supervisors’ evaluation of overall competence as the major crite- 
rion variable. Another criterion variable they used was peer 
ratings of competence. Our comparison is based on the third of 
their three studies, which cross-validated predictors from the 
previous two studies. Four judges’ ratings of about 65 residents, 
using different methods of evaluation (such as analyses of appli- 
cation materials, interviews, Thematic Apperception Test, and 
Rorschach protocols), were correlated with peer and supervisor 
ratings of residents’ competence. Judge ratings were correlated 
with supervisor and peer ratings of the residents on psychother- 
apy competence, diagnostic competence, management compe- 
tence, and overall competence. The average effect size relating 
judges’ overall ratings and the criterion variable of supervisor 
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Contrasting Effect Sizes (r) That Are Based on “Thick” Slices of Behavior 


With Effect Sizes From the Present Study 


Study Contrast Contrast 
Predictor variable Criterion variable r Zz Pp 
OSS study* Job performance 26 0.62 27 
appraisal ratings 
VA assessment study” Clinical competence 
Best objective test 28 0.56 29 
Pooled ratings 29 0.56 29 
Menninger study° Clinical competence 31 0.17 44 
Judges’ ratings and 
psychological! tests 
Meta-analysis? Teaching effectiveness 
Self-reports (students’ ratings 07 1.36 .09 
Students’ ratings based on classroom Al 0.06 47 
Colleagues’ ratings performance) .33 0.19 .42 
Meta-analysis® Deception detection 32 0.12 45 
Judges’ ratings 
Meta-analysis‘ Deception detection 45 0.54 .29 
Judges’ ratings 
Median® 31 
Note. OSS = Office of Strategic Services, VA = Veterans Administration. 
* Wiggins (1973), p. 534. ° Wiggins (1973), p. 563. ° Holt and Luborsky (1958), p. 213. 4 Feldman 


(1986). 
meta-analyses). 
present meta-analyses). 


© DePaulo, Zuckerman, and Rosenthal (1980; excluding studies overlapping with the present 
* Zuckerman, DePaulo, and Rosenthal (1981; excluding studies overlapping with the 
® The median is based on five independent results. We computed one effect size 


each for the VA assessment study and Feldman’s (1986) meta-analysis. DePaulo et als (1980) and Zucker- 
man, DePaulo, and Rosenthal’s (1981) meta-analysis were combined because they were not completely 


independent. 


ratings was .31; the average effect size relating judge ratings and 
peer ratings was .30. These effect sizes are not significantly 
different from the overall effect size or the effect sizes of just 
the clinical outcomes in the present study, as can be seen in 
Table 7. Thus, ratings from thin slices of behavior apparently 
predict certain clinical criteria as well as more complicated and 
lengthy methods, as was advocated by Carl Rogers and his asso- 
ciates. 

Fourth, we compared our results with a study on college 
teacher effectiveness. The most commonly used criterion mea- 
sure for teacher effectiveness is student evaluations, a measure 
with high ecological validity because it is used for promotion, 
hiring, and tenure decisions. In a meta-analysis of the extant 
literature relating teacher effectiveness to aspects of teacher per- 
sonality, Feldman (1986) compared college teacher effective- 
ness ratings that were based on student evaluations with (a) 
self-report measures of personality, (b) student ratings of 
teachers’ personality, and, () colleague ratings of teacher per- 
sonality. He conducted 14 separate meta-analyses on 14 broad 
personality traits evaluated in the literature. Averaging across 
these traits, we found that the average correlation with teacher 
effectiveness was as follows: (a) For self-report measures, r= .07 
(Z = 3.05), (b) for students’ ratings of personality, r = .41 (Z = 
13.71), and (© for colleagues’ ratings of personality, r= .33 (Z= 
8.37). Contrast analyses comparing the average effect size from 
the present meta-analysis with these three effect sizes individu- 
ally as well as in combination (r = .28) were not significant. 
Ratings from thin slices of behavior are apparently as good a 


predictor of teaching effectiveness as other measures. This re- 
sult is quite surprising because colleagues and students have 
access to so much more information about the subject than 
judges viewing clips of behavior under 300 s in length. 

Our final comparison was with two meta-analyses on the 
accuracy of detecting deception. In both studies, no overall 
effect size was reported for accuracy, but effect sizes were re- 
ported for the different behavioral channels and different com- 
binations of channels. It was not possible for us to compute 
overall effect sizes from the information provided in the stud- 
ies. We were, however, able to compare the effect sizes for the 
accuracy of detection of deception from the face, body, and 
speech channels for both studies (leaving out results included in 
our analysis) with that found in the present analysis. In the first 
meta-analysis, the effect size for 6 studies was .32 (DePaulo et 
al., 1980). In the second one, the effect size for 17 studies on the 
accuracy of deception detection from observations of the face, 
body, and speech channels was .45 (Zuckerman et al., 1981). 
The corresponding effect size in the present analysis was .24. 
Again, contrast analyses revealed no significant differences be- 
tween the accuracy of detecting deception among the three 
results (Z = .12 with DePaulo et al., 1980, Z = .54 with Zucker- 
man et al., 1981). This result suggests that thin slices of behav- 
ior may be used to predict deception about as accurately as 
longer observations do. 

The combined median effect size of all the studies using 
thick slices of behavior (using only one entry per study and 
combing DePaulo et al’s, 1980, and Zuckerman, DePaulo, & 
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Rosenthal’s, 1981, meta-analyses because they are not com- 
pletely independent) was .31. This figure is very close to the 
effect size of .39 obtained from thin slices of behavior. 


Discussion 
Summary of Findings 


Thin slices of behavior provide a great deal of information 
and permit significantly accurate predictions. The effect size of 
.39 for the overall accuracy of prediction from observations of 
less than 5 min is higher than most of the effect sizes found in 
social and personality psychology (Cohen, 1988). An r of .39 
with the criterion, according to Rosenthal and Rubin's bino- 
mial effect-size display, means that correct classifications can 
be made using thin slices of behavior nearly 70% of the time, 
compared with about 30% of the time when no thin slices are 
available (Rosenthal & Rubin, 1982). Furthermore, the thin- 
ness of the slice does not seem to affect the accuracy of predic- 
tions: Judgments from under 30 s of observation were as accu- 
rate as those made from 5-min observations. Indeed, the level 
of accuracy did not differ significantly between 30-s, I-, 2-, 3-, 4, 
and 5-min-long observations. Moreover, the accuracy of predic- 
tions from thin slices of behavior did not differ significantly 
from the accuracy of predictions that were based on lengthier 
observations of behavior, such as those in some of the classic 
studies on the prediction of behavior. In fact, other studies 
using a number of different measures to predict aspects of per- 
sonality have found effect sizes ranging from .30 to .40 (Funder 
& Ozer, 1983). This result contradicts the commonsense notion 
that more information leads to greater accuracy; the additional 
information might be redundant, or even counterproductive 
(Wilson & Schooler, 1991). 

Although specific behaviors exhibited within a situation 
might vary considerably, it appears that some stable underlying 
essence is picked up by judges. The consistency of predictions 
that are based on thin slices of behavior indicates that the 
“something” in the nature of people that, according to Allport 
(1937), leads observers to perceive them in a certain way is 
communicated through their expressive behavior. Individuals 
might not be perceived in exactly the same manner from one 
observation to another because of some degree of variability in 
their behavior; however, raters’ relative ranking of individuals 
seems to be fairly stable (Kenrick & Funder, 1988). The data 
presented in this article indicate that judgments across differ- 
ent thin slices of behavior are quite consistent, although some 
people are easier to judge than others, as has been illustrated by 
findings regarding the “demeanor bias” (Kraut, 1982; Riggio, 
Tucker, & Widaman, 1987; Zuckerman, Larrance, Hall, De- 
Frank, & Rosenthal, 1979), and some people are better judges 
than others (Rosenthal et al., 1979). 


Tentative Explanations 


Why are judgments from thin slices of behavior so accurate? 
We can suggest a few tentative explanations that are not mutu- 
ally exclusive. The first explanation is derived from the ecologi- 
cal approach to social perception. The second draws on evi- 


dence regarding the kernel of truth to stereotypes and the effect 
of self-fulfilling prophecies. The third explanation is based on 
evidence regarding the disruptive effects of thinking and rea- 
soning. 

The first explanation is suggested by McArthur and Baron's 
(1983) ecological approach to social perception. Certain attrib- 
utes such as anger, fear, or dominance might be quickly and 
easily recognizable because they are more essential for survival 
and adaptive action. On the other hand, attributes like reliabil- 
ity or humor may be harder to detect because they are less 
essential for immediate survival and adaption to the environ- 
ment and require more inferential processes to identify them. 
Detection of the attribute depends on the context in which the 
target is being observed. For example, honesty may be more 
easy to detect when observing salesmen than when observing 
teachers. 

Zajonc (1980, 1984) also suggested that immediate affective 
reactions to stimuli precede cognitive and perceptual opera- 
tions. His explanation, similar to the ecological approach, was 
that these reactions are hypothesized to result from a primitive 
neurological system that allows for quick analyses and rapid 
action in case of threats to the organism. The initial affective 
reactions are based on qualities that automatically draw atten- 
tion and can be evaluated at a preconscious level for favorabi- 
lity-unfavorability. In judging personal attributes, it is likely 
that attributes relating to affect, emanating from expressive be- 
havior, allow for quick processing along a pleasant-unpleasant 
or a safety-threat dimension. Evidence for the preattentive pro- 
cessing of angry faces as opposed to happy faces in a crowd of 
dissimilar faces, presumably because of a preattentive search 
for threat, suggests this hypothesis might be true (Hansen & 
Hansen, 1988). Recent research on the automaticity of social 
information processing indicates that we do react in an auto- 
matic, affective, and evaluative manner to social stimuli (Isen, 
1984) and spontaneously and automatically categorize social 
information into traits (Smith & Miller, 1983; Srull & Wyer, 
1979; Winter & Uleman, 1984). This process seems to be auto- 
matic, in that this categorization occurs even when trait stimuli 
are presented subliminally (Bargh, 1988; Bargh & Pietromon- 
aco, 1982). For example, in some of these experiments, subjects 
who were not even aware that a trait word had been flashed 
outside the visual foveal field still rated a target person as pos- 
sessing more of the trait than control subjects (Bargh, Bond, 
Lombardi, & Tota, 1986; Bargh & Pietromonaco, 1982). Other 
studies have suggested that this unintentional processing is 
more likely to occur in the case of social stimuli, that are auto- 
matically evaluated as “good” or “bad” (Fazio, Sanbonmatsu, 
Powell, & Kardes, 1986). Because of some kind of non- 
conscious “tacit knowledge” (Polanyi, 1966), judges seem to be 
able to rate very brief exposures of behavior fairly accurately on 
various affective and evaluative dimensions. 

Our findings indicate clearly that certain affective, interper- 
sonally oriented dimensions of personality can be judged quite 
rapidly, efficiently, and accurately. It is certainly possible that 
judgments of these dimensions that are based on thin slices of 
behavior are accurate because recognition of these dimensions 
may be more important for survival and adaptation to the envi- 
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ronment. Furthermore, these dimensions seem to be revealed 
through various channels of expressive behavior. 

One problem with applying this explanation to the present 
findings is that the judgments made by subjects in the studies 
reviewed here were not completely spontaneous. We cannot say 
whether these judgments would have been made if judges had 
not been instructed to do so. Some studies have indicated that 
subjects make trait attributions only if they are primed or in- 
structed to do so (Bassili, 1989; Bassili & Smith, 1986; Hastie & 
Pennington, 1989). The stimuli in all these studies, however, 
consisted of descriptions of behavior. When subjects have ac- 
cess to actual behavior, however, as in the studies reviewed in 
this meta-analysis, we believe that automatic evaluation does 
occur. 

A second, somewhat related explanation is that initial judg- 
ments are influenced by the activation in memory of common 
stereotypes that might possess a kernel of truth (Baron & 
Boudreau, 1987; McArthur, 1982; Watson, 1989). Evidence for 
the kernel of truth in perceptions that are based on stereotypes 
comes from studies relating targets’ physical characteristics to 
judgments of various attributes of their personality (Berry, 
1990; Berry & Brownlow, 1989; Berry & McArthur, 1985; 
Brownlow & Zebrowitz, 1990; McArthur & Montepare, 1989; 
Raines, Hechtman, & Rosenthal, 1990). For example, Berry 
and McArthur (1985) found that adults with baby faces were 
perceived as more honest, naive, warmer, and kinder than 
more mature-faced adults. Likewise, baby-faced adults guilty of 
criminal acts were given lighter sentences (Berry & Zebrowitz- 
McArthur, 1988). There is also evidence for stereotyping on the 
basis of vocal characteristics. People are able to discriminate 
vocal attractiveness and attribute dispositional characteristics 
to people accordingly (Zuckerman & Driver, 1989; Zuckerman, 
Hodgins, & Miyake, 1990). 

Furthermore, evidence linking biological, physical, and tem- 
peramental attributes is increasing. Findings indicate that 
stable differences in the presence of certain traits in infants and 
children—-specifically traits associated with behavioral inhibi- 
tion—are potentially related to differences in underlying biolog- 
ical factors, especially in processes that originate in the limbic 
system (Kagan, Reznick, Clarke, Snidman, & Garcia-Coll, 
1984; Kagan, Reznick, & Snidman, 1988; Reznick et al., 1986). 
Furthermore, these differences have also been associated with 
differences in physical characteristics in adults. Thus shyer, 
more inhibited men seem to have more lightly colored eyes and 
more ectomorphic physiques than more sociable men (Her- 
bener, Kagan, & Cohen, 1989; Rosenberg & Kagan, 1987; also 
see Sheldon, Stevens, & Tucker, 1940). Attempts are being 
made to explain why these biological, physical, and tempera- 
mental characteristics might be intercorrelated (Kagan, 1989). 

Activation of these physically based stereotypes probably 
creates expectations in others that influence the behavior of the 
target individuals. Research has shown that our expectations 
affect our behavior toward others, which in turn modifies their 
behavior to confirm these expectations, creating a self-fulfill- 
ing prophecy (Anderson & Bem, 1981; Curtis & Miller, 1986; 
Rosenthal & Jacobson, 1968; Snyder et al., 1977). It is therefore 
possible that through processes such as behavioral confirma- 
tion or self-verification (Snyder et al., 1977; Swann & Read, 
1981), people develop a repertoire of behaviors and a style of 


interacting that validate and confirm their own and others’ 
expectations that are based on their physical characteristics. 
Thus, physically attractive people who are judged to possess 
more socially desirable personality traits may internalize these 
expectations and may actually become more socially skilled, 
likeable, and confident (Adams, 1977; Berscheid & Walster, 
1974; Dion, 1986; Goldman & Lewis, 1977). Similarly, baby- 
faced people may behave in more naive and less dominant ways 
because people expect them to behave in such ways, internaliz- 
ing this view of themselves and thereby validating the kernel of 
truth in the stereotype. 

A third explanation could be that predictions that are based 
on thin slices may be accurate because of the absence of dis- 
tracting stimuli. Research has indicated that subjects involved 
in face-to-face interactions with targets were less accurate in 
their judgments than subjects who formed impressions from 
videotapes of the targets (Gilbert & Krull, 1988; Toris & De- 
Paulo, 1984). When people are involved in actual interactions, 
they may be distracted by factors such as the verbal component 
of the interaction or the demands of impression management 
and self-presentation. Besides distracting external stimuli, dis- 
tracting internal processing might also decrease the accuracy of 
judgments. Too much thinking and reasoning can sometimes 
be disruptive of judgmental accuracy. People make better affec- 
tive judgments and decisions when they introspect less and do 
not seek reasons to explain their feelings (Wilson, Dunn, Kraft, 
& Lisle, 1989; Wilson & Schooler, 1991). Judgments that are 
based on thin slices of behavior may be accurate precisely be- 
cause they are snap judgments. Note that this explanation con- 
tradicts the assumption of the first explanation that people can 
screen out distracting stimuli and focus on dimensions critical 
to the context in which the judgment is being made. 

We are not convinced that any one of these theories alone can 
explain our findings. Aspects of each of these seem to influ- 
ence judgments that are based on thin slices of behavior. It 
seems likely that the thinness of the slice eliminates distracting 
stimuli and enables judges to focus on expressive behavior. It is 
also likely that judgments regarding certain dimensions are 
accurate because we are used to rapidly making judgments that 
enable survival and adaptation to the environment. Further- 
more, judgments that are based on thin slices of behavior may 
be accurate because they activate stereotypes that are accurate 
because the social world operates to reinforce and maintain 
certain patterns of behavior in people. 


Implications and Conclusions 


These findings have theoretical implications for the debate 
in personality psychology regarding the consistency of behav- 
ior. This debate has been summarized elsewhere (Kenrick & 
Funder, 1988; Ross & Nisbett, 1991), and we do not discuss it in 
detail here. In a review of the issues in this debate, Kenrick and 
Funder identified the circumstances under which behavior can 
be predicted from trait ratings. It can be predicted when (a) 
publicly observable dimensions are rated, (b) raters are familiar 
with the target, (©) multiple raters are used, (d) multiple observa- 
tions of the target are made, and €) behaviors relative to the 
dimensions rated are being predicted (Kenrick & Funder, 
1988). The studies included in this meta-analysis met ail but 
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two of the criteria for the accurate prediction of behavior. First, 
the raters were not familiar with the target, and second, in a 
number of the studies surveyed multiple behavioral observa- 
tions were not made. These exceptions have important implica- 
tions for the issue of behavioral consistency. Raters might not 
have to be familiar with the target, and multiple observations of 
behavior might not be needed if the dimensions evaluated are 
truly relevant to the outcome being predicted (high validity) 
and if for the most part there is good agreement among the 
raters (high reliability) for accurate predictions, even if they are 
based on only thin slices of behavior. The importance of care- 
fully selecting the traits and behaviors to be judged is high- 
lighted by these results because certain traits are only revealed 
in and are only relevant to certain situations (Allport, 1966; 
Bem & Funder, 1978; Epstein, 1979; Funder & Dobroth, 1987; 
Kenrick & Funder, 1988; Kenrick, McCreath, Govern, King, & 
Bordin, 1990). For example, the low validity of unstructured 
interviews in predicting job performance, college success, and 
professional success (Hunter & Hunter, 1984) can be attributed 
to the inadequate sampling of truly relevant behaviors (Ross & 
Nisbett, 1991). Therefore, the relevance, representativeness, 
and ecological validity of the behavior as well as the outcome 
measures are important for accurate prediction. To the degree 
that situations overlap and individuals are consistent in their 
style of behavior across different situations, these predictions 
should be generalizable across situations (Allport, 1937; Ep- 
stein, 1979; Kenrick & Stringfield, 1980). 

Related to this issue is the issue regarding the types of behav- 
iors or dimensions that can be judged accurately by using rat- 
ings of thin slices of behavior. It would be unrealistic to suggest 
that brief observations can predict most clinical and social out- 
comes. Brief observations may be most appropriate in predict- 
ing criterion variables characterized by observability and affec- 
tivity. First, as stated by Kenrick and Funder (1988), the behav- 
iors or traits to be judged should be observable to permit 
reliability in ratings. Studies on the accuracy of personality 
judgments, using self-reports as a criterion and peer and 
stranger ratings as predictors, have generally found that observ- 
able traits and behaviors are more accurately judged than less 
observable ones (Albright et al., 1988; Funder & Colvin, 1988; 
Kenrick & Funder, 1988; Kenrick & Stringfield, 1980; Kor- 
etzky, Kohn, & Jeger, 1978; McCrae, 1982; Watson, 1989). Thus, 
traits such as extraversion and conscientiousness seem to be 
judged more accurately than traits such as emotional stability. 

Second, the dimensions or traits to be judged should include 
a substantial affective or interpersonally oriented component, 
such as teachers’ expectations or patient satisfaction with doc- 
tors. Although dimensions such as anxiety, dominance, shy- 
ness, or warmth might be revealed in brief observations, less 
interpersonal but more personal qualities such as con- 
scientiousness, intelligence, or persistence are probably more 
difficult to judge in this way. These affective, observable di- 
mensions seem to be the ones that need to be judged quickly for 
survival and adaptation to the environment (McArthur & 
Baron, 1983; Zajonc, 1980, 1984). These interpersonal variables 
can be assessed even when the segment of behavior to be 
judged does not show an interpersonal interaction but shows 
only one target person. Early research has also revealed that 
certain personality dimensions such as inhibition-impulsion, 


apathy-intensity, and ascendance-submission are judged more 
accurately from brief motion pictures than are dimensions 
such as creativity, interest in ideas and theories, or a liking for 
contemplative observation (Estes, 1938). This finding may be 
because the latter dimensions were less observable, interper- 
sonally oriented, and less revealing of affect than the former 
dimensions. 

These findings also have several practical implications. First, 
researchers can save time (their own and that of their raters) and 
money by using thin slices of behavior to evaluate important 
affective variables, without sacrificing accuracy. Second, rat- 
ings of thin slices of behavior can be used to predict important 
criterion variables, particularly those that are interpersonally 
oriented. For example, these ratings might be used to identify 
biased teachers, assess aspects of the therapeutic process, or 
gauge the expectancies of various targets such as newscasters. 
Third, ratings of thin slices of behavior might be very useful in 
the selection, training, and evaluation of people who need 
strong interpersonal skills, such as managers, salespersons, 
teachers, and therapists. Fourth, the channel of communica- 
tion (verbal or nonverbal) does not seem to affect the accuracy 
of ratings when exposures are very brief, implying that ratings 
can be based on any channels that can be conveniently re- 
corded. 

In addition, these results provide additional support for the 
accuracy of the layperson’s intuitive judgments (Funder, 1987; 
Kenny & Albright, 1987; Swann, 1984; Wilson & Schooler, 
1991). They reveal that we unknowingly encode and decode a 
great deal of information regarding various aspects of our- 
selves. Funder proposed two criteria to evaluate the accuracy of 
social judgments. First, do the judgments agree with each 
other? Second, do they predict behavior? Most of the research 
considered in this article meets both criteria. Overall, judges 
tended to agree with each other, and their ratings did indeed 
predict the criterion variables. Gordon Allport (1937) observed 
that 


a brief acquaintance often does result in amazingly rich impres- 
sions, many of which are proved on further acquaintance to be 
correct. Such successful judgments are significant because, lack- 
ing personal information, or a telltale context of conversation, the 
cues are derived entirely from expressive movements—from ap- 
pearance, gesture, and manner of speaking. (p. 500) 


This observation was confirmed by our findings. The probabi- 
listic expectancies we form about others from very limited in- 
formation are more accurate than we would expect. 
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